Project Based

CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and/or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

DATA DESCRIPTION: sensor-data.csv : (1567, 592) The data consists of 1567 examples each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column –1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point.

PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.

Importing the libraries

Import and warehouse data

Observations:

Data Cleansing

Observations:

Observations:

Data Analysis and Visualisation

Observations:

Observations:

From describe() and the visual analysis we could observe:

Observations:

NOTE:: My laptop processing is not allowing me to visualise such plots. We can anytime uncomment the below snippets to visualise the facts.

Data pre-processing

Observations:

Observations:

Observations:

Observations:

We can also carry out Hypothesis testing to see different statistics.

Observations:

Model training, testing and tuning

Observations:

Not Sampled Data

Observations:

Random Under-Sampling

Randomly decreasing the frequency of the majority target class.

Observations:

Synthetic Minority Over-Sampling Technique(SMOTE)

This increases the frequency of the minority class based on K-NN.

Observations:

Random Over-Sampling

Randomly increasing the frequency of the minority target class.

Observations:

Gaussain Naive Bayes on Original Dataset

Observations:

Gaussian Naive Bayes on Under sampled Data

Observations:

LightGBM on SMOTE sampled Dataset with Random Search CV

We will try to fit a LightGBM classifier on SMOTE sampled data with RandomizedSearchCV to tune best hyper-parameters.

Observations:

RandomForest on Random over sampled Dataset with Random Search CV

We will try to fit a RandomForest classifier on over-sampled data with RandomizedSearchCV to tune best hyper-parameters.

Observations:

SVM with under-sampled data and Grid Search CV

We will try to fit a SVM classifier on under-sampled data with GridSearchCV to tune best hyper-parameters.

Observations:

Algorithm Comparison

ROC-AUC plot

Observations:

Principal Component Analysis(PCA)

We will perform PCA to reduce dimensions and check if our model performance increases.

Observations:

Observations:

Random Under-Sampling

Randomly decreasing the frequency of the majority target class.

Synthetic Minority Over-Sampling Technique(SMOTE)

This increases the frequency of the minority class based on K-NN.

Random Over-Sampling

Randomly increasing the frequency of the minority target class.

Gaussain Naive Bayes on Original Dataset

Observations:

Gaussian Naive Bayes on Under sampled Data

Observations:

Logistic Regression on original Dataset with Grid Search CV

We will try to fit a LogisticRegression classifier on original data with GridSearchCV to tune best hyper-parameters.

Observations:

LightGBM on SMOTE sampled Dataset with Random Search CV

We will try to fit a LightGBM classifier on SMOTE sampled data with RandomizedSearchCV to tune best hyper-parameters.

Observations:

RandomForest on Random over sampled Dataset with Random Search CV

We will try to fit a RandomForest classifier on over-sampled data with RandomizedSearchCV to tune best hyper-parameters.

Observations:

SVM with under-sampled data and Grid Search CV

We will try to fit a SVM classifier on under-sampled data with GridSearchCV to tune best hyper-parameters.

Observations:

Algorithm Comparison after dimension reduction by PCA

ROC-AUC plot

Observations:

Final Models to use

Importing Unseen Data

Gaussain Naive Bayes on Original Dataset

Random Forest with Over-Sampled Data

Observations and Conclusions:

NOTE:: I wanted to achieve more in this model by visual analysis and try for more hyper-parameter searches but couldn't perform due to hardware restraints. Would be more beneficial if we could have a hands-on of optimization of hardware usage for faster performance.